diff --git a/docs/guide/equivalence.ipynb b/docs/guide/equivalence.ipynb index 99083d0..41b24ed 100644 --- a/docs/guide/equivalence.ipynb +++ b/docs/guide/equivalence.ipynb @@ -62,8 +62,16 @@ "\n", "TODO: Add a link to the committee note when it is released.\n", "\n", + "There are a number of use cases for which calculating semantic equivalence may be helpful. It can be used for echo detection, in which a STIX producer who consumes content from other producers wants to make sure they are not creating content they have already seen or consuming content they have already created.\n", + "\n", + "Another use case for this functionality is to identify identical or near-identical content, such as a vulnerability shared under three different nicknames by three different STIX producers. A third use case involves a feed that aggregates data from multiple other sources. It will want to make sure that it is not publishing duplicate data.\n", + "\n", "Below we will show examples of the semantic equivalence results of various objects. Unless otherwise specified, the ID of each object will be generated by the library, so the two objects will not have the same ID. This demonstrates that the semantic equivalence algorithm only looks at specific properties for each object type.\n", "\n", + "**Please note** that you will need to install a few extra dependencies in order to use the semantic equivalence functions. You can do this using:\n", + "\n", + "```pip install stix2[semantic]```\n", + "\n", "### Attack Pattern Example\n", "\n", "For Attack Patterns, the only properties that contribute to semantic equivalence are `name` and `external_references`, with weights of 30 and 70, respectively. In this example, both attack patterns have the same external reference but the second has a slightly different yet still similar name." @@ -145,7 +153,7 @@ ".highlight .vg { color: #19177C } /* Name.Variable.Global */\n", ".highlight .vi { color: #19177C } /* Name.Variable.Instance */\n", ".highlight .vm { color: #19177C } /* Name.Variable.Magic */\n", - ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
85.3\n",
+       ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
91.9\n",
        "
\n" ], "text/plain": [ @@ -270,7 +278,7 @@ ".highlight .vg { color: #19177C } /* Name.Variable.Global */\n", ".highlight .vi { color: #19177C } /* Name.Variable.Instance */\n", ".highlight .vm { color: #19177C } /* Name.Variable.Magic */\n", - ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
50.0\n",
+       ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
30.0\n",
        "
\n" ], "text/plain": [ @@ -773,12 +781,12 @@ "source": [ "### Threat Actor Example\n", "\n", - "For Threat Actors, the only properties that contribute to semantic equivalence are `threat_actor_types`, `name`, and `aliases`, with weights of 20, 60, and 20, respectively. In this example, the two threat actors have the same id properties but everything else is different. Since the id property does not factor into semantic equivalence, the result is not very high. The result is not zero because the algorithm is using the Jaro-Winkler distance between strings in the threat_actor_types and name properties." + "For Threat Actors, the only properties that contribute to semantic equivalence are `threat_actor_types`, `name`, and `aliases`, with weights of 20, 60, and 20, respectively. In this example, the two threat actors have the same id properties but everything else is different. Since the id property does not factor into semantic equivalence, the result is not very high. The result is not zero because of the \"Token Sort Ratio\" algorithm used to compare the `name` property." ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 9, "metadata": { "scrolled": true }, @@ -854,14 +862,14 @@ ".highlight .vg { color: #19177C } /* Name.Variable.Global */\n", ".highlight .vi { color: #19177C } /* Name.Variable.Instance */\n", ".highlight .vm { color: #19177C } /* Name.Variable.Magic */\n", - ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
33.6\n",
+       ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
6.6000000000000005\n",
        "
\n" ], "text/plain": [ "" ] }, - "execution_count": 5, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } @@ -1119,7 +1127,7 @@ "source": [ "### Other Examples\n", "\n", - "Comparing objects of different types will result in an error." + "Comparing objects of different types will result in a `ValueError`." ] }, { @@ -1149,7 +1157,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 13, "metadata": {}, "outputs": [ { @@ -1237,7 +1245,7 @@ "" ] }, - "execution_count": 6, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" } @@ -1295,7 +1303,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You can optionally allow comparing across spec versions by providing a configuration dictionary like in the next example:" + "You can optionally allow comparing across spec versions by providing a configuration dictionary using `ignore_spec_version` like in the next example:" ] }, { @@ -1400,162 +1408,28 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You can modify the weights or provide your own functions for comparing objects of a certain type by providing them in a dictionary to the optional 3rd parameter to the semantic equivalence function. You can find functions (like `partial_string_based`) to help with this in the [Environment API docs](../api/stix2.environment.rst#stix2.environment.Environment). In this example we define semantic equivalence for our new `x-foobar` object type:" + "### Detailed Results\n", + "\n", + "If your logging level is set to `DEBUG` or higher, the function will log more detailed results. These show the semantic equivalence and weighting for each property that is checked, to show how the final result was arrived at." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
60.0\n",
-       "
\n" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def _x_foobar_checks(obj1, obj2, **weights):\n", - " matching_score = 0.0\n", - " sum_weights = 0.0\n", - " if stix2.environment.check_property_present(\"name\", obj1, obj2):\n", - " w = weights[\"name\"]\n", - " sum_weights += w\n", - " matching_score += w * stix2.environment.partial_string_based(obj1[\"name\"], obj2[\"name\"])\n", - " if stix2.environment.check_property_present(\"color\", obj1, obj2):\n", - " w = weights[\"color\"]\n", - " sum_weights += w\n", - " matching_score += w * stix2.environment.partial_string_based(obj1[\"color\"], obj2[\"color\"])\n", - " return matching_score, sum_weights\n", - "\n", - "weights = {\n", - " \"x-foobar\": {\n", - " \"name\": 60,\n", - " \"color\": 40,\n", - " \"method\": _x_foobar_checks,\n", - " },\n", - " \"_internal\": {\n", - " \"ignore_spec_version\": False,\n", - " },\n", - "}\n", - "foo1 = {\n", - " \"type\":\"x-foobar\",\n", - " \"id\":\"x-foobar--0c7b5b88-8ff7-4a4d-aa9d-feb398cd0061\",\n", - " \"name\": \"Zot\",\n", - " \"color\": \"red\",\n", - "}\n", - "foo2 = {\n", - " \"type\":\"x-foobar\",\n", - " \"id\":\"x-foobar--0c7b5b88-8ff7-4a4d-aa9d-feb398cd0061\",\n", - " \"name\": \"Zot\",\n", - " \"color\": \"blue\",\n", - "}\n", - "print(env.semantically_equivalent(foo1, foo2, **weights))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Detailed Results\n", - "\n", - "If your logging level is set to `DEBUG` or higher, the function will log more detailed results. These show the semantic equivalence and weighting for each property that is checked, to show how the final result was arrived at." - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "Starting semantic equivalence process between: 'threat-actor--54dc2aac-6fde-4a68-ae2a-0c0bc575ed70' and 'threat-actor--c51bce3b-a067-4692-ab77-fcdefdd3f157'\n", - "--\t\tpartial_string_based 'Evil Org' 'James Bond'\tresult: '0.56'\n", - "'name' check -- weight: 60, contributing score: 33.6\n", + "Starting semantic equivalence process between: 'threat-actor--664624c7-394e-49ad-ae2a-12f7a48a54a3' and 'threat-actor--1d67719e-6be6-4194-9226-1685986514f5'\n", + "--\t\tpartial_string_based 'Evil Org' 'James Bond'\tresult: '11'\n", + "'name' check -- weight: 60, contributing score: 6.6\n", "--\t\tpartial_list_based '['crime-syndicate']' '['spy']'\tresult: '0.0'\n", "'threat_actor_types' check -- weight: 20, contributing score: 0.0\n", "--\t\tpartial_list_based '['super-evil']' '['007']'\tresult: '0.0'\n", "'aliases' check -- weight: 20, contributing score: 0.0\n", - "Matching Score: 33.6, Sum of Weights: 100.0\n" + "Matching Score: 6.6, Sum of Weights: 100.0\n" ] }, { @@ -1629,14 +1503,14 @@ ".highlight .vg { color: #19177C } /* Name.Variable.Global */\n", ".highlight .vi { color: #19177C } /* Name.Variable.Instance */\n", ".highlight .vm { color: #19177C } /* Name.Variable.Magic */\n", - ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
33.6\n",
+       ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
6.6000000000000005\n",
        "
\n" ], "text/plain": [ "" ] }, - "execution_count": 17, + "execution_count": 16, "metadata": {}, "output_type": "execute_result" } @@ -1657,42 +1531,25 @@ " name=\"James Bond\",\n", " aliases=[\"007\"],\n", ")\n", - "print(env.semantically_equivalent(ta3, ta4))" + "print(env.semantically_equivalent(ta3, ta4))\n", + "\n", + "logger.setLevel(logging.ERROR)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Custom Comparisons\n", - "If you wish, you can run your own custom semantic equivalence comparisons. Specifically, you can do any of three things:\n", - " - Provide custom weights for each semantic equivalence contributing property\n", - " - Provide custom comparison functions for individual semantic equivalence contributing properties\n", - " - Provide a custom semantic equivalence method\n", + "You can also retrieve the detailed results in a dictionary so the detailed results information can be accessed and used more programatically. The [`semantically_equivalent()`](../api/stix2.environment.rst#stix2.environment.Environment.semantically_equivalent) function takes an optional third argument, called `prop_scores`. This argument should be a dictionary into which the detailed debugging information will be stored.\n", "\n", - "*Some of this has already been explained above, but we will go into more detail here.*\n", + "Using `prop_scores` is simple: simply pass in a dictionary to `semantically_equivalent()`, and after the function is done executing, the dictionary will have the various scores in it. Specifically, it will have the overall `matching_score` and `sum_weights`, along with the weight and contributing score for each of the semantic equivalence contributing properties.\n", "\n", - "#### The `weights` dictionary\n", - "In order to do any of the aforementioned (*optional*) custom comparisons, you will need to provide a `weights` dictionary to the `semantically_equivalent()` method call. At a minimum, you must provide the custom weight and custom comparison function for each property. Now, you may use the default weights, or provide your own. You may also use any of the existing comparison functions, or provide your own.\n", - "\n", - "##### Existing comparison functions\n", - "For reference, here is a list of comparison functions already in the codebase (found in stix2/environment.py):\n", - " - `partial_timestamp_based`\n", - " - `partial_list_based`\n", - " - `exact_match`\n", - " - `partial_string_based`\n", - " - `custom_pattern_based`\n", - " - `partial_external_reference_based`\n", - " - `partial_location_distance`\n", - "\n", - "For instance, if we wanted to compare two `ThreatActor`s, but use our own weights, then we could do the following:\n", - "\n", - "(**Please note that if you provide a custom weights dictionary but not a custom semantic equivalence method [shown later], then you must follow the general format shown in the `weights` dict below**)" + "For example:" ] }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 18, "metadata": {}, "outputs": [ { @@ -1766,14 +1623,14 @@ ".highlight .vg { color: #19177C } /* Name.Variable.Global */\n", ".highlight .vi { color: #19177C } /* Name.Variable.Instance */\n", ".highlight .vm { color: #19177C } /* Name.Variable.Magic */\n", - ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
Using standard weights: 43.6\n",
+       ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
Semantic equivalence score using standard weights: 16.6\n",
        "
\n" ], "text/plain": [ "" ] }, - "execution_count": 7, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, @@ -1848,273 +1705,14 @@ ".highlight .vg { color: #19177C } /* Name.Variable.Global */\n", ".highlight .vi { color: #19177C } /* Name.Variable.Instance */\n", ".highlight .vm { color: #19177C } /* Name.Variable.Magic */\n", - ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
Using custom weights: 41.8\n",
+       ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
{'name': {'weight': 60, 'contributing_score': 6.6}, 'threat_actor_types': {'weight': 20, 'contributing_score': 10.0}, 'aliases': {'weight': 20, 'contributing_score': 0.0}, 'matching_score': 16.6, 'sum_weights': 100.0}\n",
        "
\n" ], "text/plain": [ "" ] }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "weights = {\n", - " \"threat-actor\": { # You must specify for which object type this dict is\n", - " \"name\": (30, stix2.environment.partial_string_based), # Each property's value must be a tuple\n", - " \"threat_actor_types\": (50, stix2.environment.partial_list_based), # The 1st component must be the weight\n", - " \"aliases\": (20, stix2.environment.partial_list_based) # The 2nd component must be the comparison function\n", - " }\n", - "}\n", - "\n", - "ta5 = ThreatActor(\n", - " threat_actor_types=[\"crime-syndicate\", \"spy\"],\n", - " name=\"Evil Org\",\n", - " aliases=[\"super-evil\"],\n", - ")\n", - "ta6 = ThreatActor(\n", - " threat_actor_types=[\"spy\"],\n", - " name=\"James Bond\",\n", - " aliases=[\"007\"],\n", - ")\n", - "\n", - "print(\"Using standard weights: %s\" % (env.semantically_equivalent(ta5, ta6)))\n", - "print(\"Using custom weights: %s\" % (env.semantically_equivalent(ta5, ta6, **weights)))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice how there is a difference in the semantic equivalence scores, simply due to the fact that custom weights were used.\n", - "\n", - "#### Custom Semantic Equivalence Function\n", - "As said before, you can also write and use your own semantic equivalence method. To do this, you must provide a `weights` dictionary to `semantically_equivalent()`. In this dict, you will provide a key of \"method\" whose value will be your custom semantic equivalence function.\n", - "\n", - "If you provide your own custom semantic equivalence method, you **must also provide the weights for each of the properties** (unless, for some reason, your custom method is weights-agnostic). However, since you are writing the custom method, your weights need not necessarily follow the tuple format specified in the above code box.\n", - "\n", - "Here we use our own custom semantic equivalence function to compare two `ThreatActor`s. " - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
Using a custom method: 21.263333333333335\n",
-       "
\n" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def custom_semantic_equivalence_method(obj1, obj2, **weights):\n", - " sum_weights = 200.0\n", - " matching_score = 20.19\n", - " for prop in weights:\n", - " if prop != \"method\":\n", - " w = weights[prop][0]\n", - " comp_funct = weights[prop][1]\n", - " contributing_score = w * comp_funct(obj1[prop], obj2[prop])\n", - " sum_weights += w\n", - " matching_score += contributing_score\n", - " return matching_score, sum_weights\n", - "\n", - "\n", - "weights = {\n", - " \"threat-actor\": {\n", - " \"name\": (60, stix2.environment.partial_string_based), # We left each property's value as a tuple\n", - " \"threat_actor_types\": (20, stix2.environment.partial_list_based), # However, weights could be simply numeric\n", - " \"aliases\": (20, stix2.environment.partial_list_based), # They may also be anything else you want\n", - " \"method\": custom_semantic_equivalence_method # As long as your func is written accordingly\n", - " }\n", - "}\n", - "\n", - "print(\"Using a custom method: %s\" % (env.semantically_equivalent(ta5, ta6, **weights)))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice the semantic equivalence score of ~21.26 when using a custom semantic equivalence method to compare `ta5` & `ta6`. Compare this to the semantic equivalence score of 43.6 when using the default semantic equivalence method for comparing `ta5` & `ta6`." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### `prop_scores`\n", - "The `semantically_equivalent()` function now takes an optional third argument, called `prop_scores`. As explained previously, the semantic equivalence functionality includes detailed debugging messages. This new argument is meant to be a dictionary that stores those detailed debugging messages so that the debug information can be accessed and used more programatically.\n", - "\n", - "Using `prop_scores` is simple: simply pass in a dictionary to `semantically_equivalent()`, and after the function is done executing, the dict will have the various scores in it. Specifically, it will have the overall `matching_score` and `sum_weights`, along with the weight and contributing score for each of the semantic equivalence contributing properties.\n", - "\n", - "For instance:" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
Semantic equivalence score using standard weights: 43.6\n",
-       "
\n" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 9, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, @@ -2189,14 +1787,14 @@ ".highlight .vg { color: #19177C } /* Name.Variable.Global */\n", ".highlight .vi { color: #19177C } /* Name.Variable.Instance */\n", ".highlight .vm { color: #19177C } /* Name.Variable.Magic */\n", - ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
Prop: name | weight: 60 | contributing_score: 33.6\n",
+       ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
Prop: name | weight: 60 | contributing_score: 6.6\n",
        "
\n" ], "text/plain": [ "" ] }, - "execution_count": 9, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, @@ -2278,7 +1876,7 @@ "" ] }, - "execution_count": 9, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, @@ -2360,7 +1958,7 @@ "" ] }, - "execution_count": 9, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, @@ -2435,14 +2033,14 @@ ".highlight .vg { color: #19177C } /* Name.Variable.Global */\n", ".highlight .vi { color: #19177C } /* Name.Variable.Instance */\n", ".highlight .vm { color: #19177C } /* Name.Variable.Magic */\n", - ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
matching_score: 43.6\n",
+       ".highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
matching_score: 16.6\n",
        "
\n" ], "text/plain": [ "" ] }, - "execution_count": 9, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, @@ -2524,17 +2122,29 @@ "" ] }, - "execution_count": 9, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ + "ta5 = ThreatActor(\n", + " threat_actor_types=[\"crime-syndicate\", \"spy\"],\n", + " name=\"Evil Org\",\n", + " aliases=[\"super-evil\"],\n", + ")\n", + "ta6 = ThreatActor(\n", + " threat_actor_types=[\"spy\"],\n", + " name=\"James Bond\",\n", + " aliases=[\"007\"],\n", + ")\n", + "\n", "prop_scores = {}\n", "print(\"Semantic equivalence score using standard weights: %s\" % (env.semantically_equivalent(ta5, ta6, prop_scores)))\n", + "print(prop_scores)\n", "for prop in prop_scores:\n", " if prop not in [\"matching_score\", \"sum_weights\"]:\n", - " print (\"Prop: %s | weight: %s | contributing_score: %s\" % (prop, prop_scores[prop][0], prop_scores[prop][1]))\n", + " print (\"Prop: %s | weight: %s | contributing_score: %s\" % (prop, prop_scores[prop]['weight'], prop_scores[prop]['contributing_score']))\n", " else:\n", " print (\"%s: %s\" % (prop, prop_scores[prop]))" ] @@ -2543,15 +2153,884 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If we wanted, we could have also passed in a custom `weights` dict to the above `semantically_equivalent()` call. If we want to use both `prop_scores` and `weights`, then they would be the third and fourth arguments, respectively, to `sematically_equivalent()`" + "### Custom Comparisons\n", + "If you wish, you can customize semantic equivalence comparisons. Specifically, you can do any of three things:\n", + " - Provide custom weights for each semantic equivalence contributing property\n", + " - Provide custom comparison functions for individual semantic equivalence contributing properties\n", + " - Provide a custom semantic equivalence function for a specific object type\n", + "\n", + "#### The `weights` dictionary\n", + "In order to do any of the aforementioned (*optional*) custom comparisons, you will need to provide a `weights` dictionary as the last parameter to the [`semantically_equivalent()`](../api/stix2.environment.rst#stix2.environment.Environment.semantically_equivalent) method call. \n", + "\n", + "The weights dictionary should contain both the weight and the comparison function for each property. You may use the default weights and functions, or provide your own.\n", + "\n", + "##### Existing comparison functions\n", + "For reference, here is a list of the comparison functions already built in the codebase (found in [stix2/environment.py](../api/stix2.environment.rst#stix2.environment.Environment)):\n", + " - [`custom_pattern_based`](../api/stix2.environment.rst#stix2.environment.custom_pattern_based)\n", + " - [`exact_match`](../api/stix2.environment.rst#stix2.environment.exact_match)\n", + " - [`partial_external_reference_based`](../api/stix2.environment.rst#stix2.environment.partial_external_reference_based)\n", + " - [`partial_list_based`](../api/stix2.environment.rst#stix2.environment.partial_list_based)\n", + " - [`partial_location_distance`](../api/stix2.environment.rst#stix2.environment.partial_location_distance)\n", + " - [`partial_string_based`](../api/stix2.environment.rst#stix2.environment.partial_string_based)\n", + " - [`partial_timestamp_based`](../api/stix2.environment.rst#stix2.environment.partial_timestamp_based)\n", + "\n", + "For instance, if we wanted to compare two of the `ThreatActor`s from before, but use our own weights, then we could do the following:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 19, "metadata": {}, - "outputs": [], - "source": [] + "outputs": [ + { + "data": { + "text/html": [ + "
Using standard weights: 16.6\n",
+       "
\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "text/html": [ + "
Using custom weights: 28.300000000000004\n",
+       "
\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "weights = {\n", + " \"threat-actor\": { # You must specify the object type\n", + " \"name\": (30, stix2.environment.partial_string_based), # Each property's value must be a tuple\n", + " \"threat_actor_types\": (50, stix2.environment.partial_list_based), # The 1st component must be the weight\n", + " \"aliases\": (20, stix2.environment.partial_list_based) # The 2nd component must be the comparison function\n", + " }\n", + "}\n", + "\n", + "print(\"Using standard weights: %s\" % (env.semantically_equivalent(ta5, ta6)))\n", + "print(\"Using custom weights: %s\" % (env.semantically_equivalent(ta5, ta6, **weights)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice how there is a difference in the semantic equivalence scores, simply due to the fact that custom weights were used.\n", + "\n", + "#### Custom Weights With prop_scores\n", + "If we want to use both `prop_scores` and `weights`, then they would be the third and fourth arguments, respectively, to `sematically_equivalent()`:" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "9.95" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "text/html": [ + "
{'name': {'weight': 45, 'contributing_score': 4.95}, 'threat_actor_types': {'weight': 10, 'contributing_score': 5.0}, 'aliases': {'weight': 45, 'contributing_score': 0.0}, 'matching_score': 9.95, 'sum_weights': 100.0}\n",
+       "
\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "prop_scores = {}\n", + "weights = {\n", + " \"threat-actor\": {\n", + " \"name\": (45, stix2.environment.partial_string_based),\n", + " \"threat_actor_types\": (10, stix2.environment.partial_list_based),\n", + " \"aliases\": (45, stix2.environment.partial_list_based),\n", + " },\n", + "}\n", + "env.semantically_equivalent(ta5, ta6, prop_scores, **weights)\n", + "print(prop_scores)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Custom Semantic Equivalence Functions\n", + "You can also write and use your own semantic equivalence functions. In the examples above, you could replace the built-in comparison functions for any or all properties. For example, here we use a custom string comparison function just for the `'name'` property:" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
Using custom string comparison: 5.0\n",
+       "
\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def my_string_compare(p1, p2):\n", + " if p1 == p2:\n", + " return 1\n", + " else:\n", + " return 0\n", + " \n", + "weights = {\n", + " \"threat-actor\": {\n", + " \"name\": (45, my_string_compare),\n", + " \"threat_actor_types\": (10, stix2.environment.partial_list_based),\n", + " \"aliases\": (45, stix2.environment.partial_list_based),\n", + " },\n", + "}\n", + "print(\"Using custom string comparison: %s\" % (env.semantically_equivalent(ta5, ta6, **weights)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also customize the comparison of an entire object type instead of just how each property is compared. To do this, provide a `weights` dictionary to `semantically_equivalent()` and in this dictionary include a key of `\"method\"` whose value is your custom semantic equivalence function for that object type.\n", + "\n", + "If you provide your own custom semantic equivalence method, you **must also provide the weights for each of the properties** (unless, for some reason, your custom method is weights-agnostic). However, since you are writing the custom method, your weights need not necessarily follow the tuple format specified in the above code box.\n", + "\n", + "Note also that if you want detailed results with `prop_scores` you will need to implement that in your custom function, but you are not required to do so.\n", + "\n", + "In this next example we use our own custom semantic equivalence function to compare two `ThreatActor`s, and do not support `prop_scores`." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
Using standard weights: 16.6\n",
+       "
\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "text/html": [ + "
Using a custom method: 6.6000000000000005\n",
+       "
\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def custom_semantic_equivalence_method(obj1, obj2, **weights):\n", + " sum_weights = 0\n", + " matching_score = 0\n", + " # Compare name\n", + " w = weights['name']\n", + " sum_weights += w\n", + " contributing_score = w * stix2.environment.partial_string_based(obj1['name'], obj2['name'])\n", + " matching_score += contributing_score\n", + " # Compare aliases only for spies\n", + " if 'spy' in obj1['threat_actor_types'] + obj2['threat_actor_types']:\n", + " w = weights['aliases']\n", + " sum_weights += w\n", + " contributing_score = w * stix2.environment.partial_list_based(obj1['aliases'], obj2['aliases'])\n", + " matching_score += contributing_score\n", + " \n", + " return matching_score, sum_weights\n", + "\n", + "weights = {\n", + " \"threat-actor\": {\n", + " \"name\": 60,\n", + " \"aliases\": 40,\n", + " \"method\": custom_semantic_equivalence_method\n", + " }\n", + "}\n", + "\n", + "print(\"Using standard weights: %s\" % (env.semantically_equivalent(ta5, ta6)))\n", + "print(\"Using a custom method: %s\" % (env.semantically_equivalent(ta5, ta6, **weights)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also write custom functions for comparing objects of your own custom types. Like in the previous example, you can use the built-in functions listed above to help with this, or write your own. In the following example we define semantic equivalence for our new `x-foobar` object type. Notice that this time we have included support for detailed results with `prop_scores`." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
71.6\n",
+       "
\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "text/html": [ + "
{'name': (60, 60.0), 'color': (40, 11.6), 'matching_score': 71.6, 'sum_weights': 100.0}\n",
+       "
\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def _x_foobar_checks(obj1, obj2, prop_scores, **weights):\n", + " matching_score = 0.0\n", + " sum_weights = 0.0\n", + " if stix2.environment.check_property_present(\"name\", obj1, obj2):\n", + " w = weights[\"name\"]\n", + " sum_weights += w\n", + " contributing_score = w * stix2.environment.partial_string_based(obj1[\"name\"], obj2[\"name\"])\n", + " matching_score += contributing_score\n", + " prop_scores[\"name\"] = (w, contributing_score)\n", + " if stix2.environment.check_property_present(\"color\", obj1, obj2):\n", + " w = weights[\"color\"]\n", + " sum_weights += w\n", + " contributing_score = w * stix2.environment.partial_string_based(obj1[\"color\"], obj2[\"color\"])\n", + " matching_score += contributing_score\n", + " prop_scores[\"color\"] = (w, contributing_score)\n", + " \n", + " prop_scores[\"matching_score\"] = matching_score\n", + " prop_scores[\"sum_weights\"] = sum_weights\n", + " return matching_score, sum_weights\n", + "\n", + "prop_scores = {}\n", + "weights = {\n", + " \"x-foobar\": {\n", + " \"name\": 60,\n", + " \"color\": 40,\n", + " \"method\": _x_foobar_checks,\n", + " },\n", + " \"_internal\": {\n", + " \"ignore_spec_version\": False,\n", + " },\n", + "}\n", + "foo1 = {\n", + " \"type\":\"x-foobar\",\n", + " \"id\":\"x-foobar--0c7b5b88-8ff7-4a4d-aa9d-feb398cd0061\",\n", + " \"name\": \"Zot\",\n", + " \"color\": \"red\",\n", + "}\n", + "foo2 = {\n", + " \"type\":\"x-foobar\",\n", + " \"id\":\"x-foobar--0c7b5b88-8ff7-4a4d-aa9d-feb398cd0061\",\n", + " \"name\": \"Zot\",\n", + " \"color\": \"blue\",\n", + "}\n", + "print(env.semantically_equivalent(foo1, foo2, prop_scores, **weights))\n", + "print(prop_scores)" + ] } ], "metadata": { @@ -2570,7 +3049,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.7" + "version": "3.6.3" } }, "nbformat": 4,