Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash_tagging: Undefined behavior within Python threading can cause incomplete processing #4050

Closed
william-billaud opened this issue Feb 17, 2022 · 8 comments · Fixed by #4238
Assignees
Labels
analysis Issue related to analysis plugins

Comments

@william-billaud
Copy link
Contributor

Description of problem:

In the hash_plugins analyzer, threads are not instantiated in the same process as the one in which they are executed. This pattern causes an undefined behavior of the is_alive function, depending on the OS/Python version (see snippets below)

Therefore, in hash_tagging plugins, if the analysis queue is empty but the analyzer still has work to do, the analyzer will be killed, resulting in a partially executed task (for example, when the plaso database is really small, the analysis is not executed).

In my opinion, the best option is to instantiate the thread class just before it starts

if not self._analyzer_started:
self._analyzer.start()
self._analyzer_started = True
rather than in init
self._analyzer = analyzer_class(self._hash_queue, self._hash_analysis_queue)

.The main disadvantage is that the TestConnection function called in the cli.helper.* classes would have to call class/static method.

Further description of the undefinied beahviour :

For example the following snippet (from https://stackoverflow.com/questions/57814933/is-alive-always-returns-false-when-called-on-a-thread-from-inside-multiprocess) result in different result deping of the os :

  • Ubuntu 20.04/Python 3.8 -> worker.is_alive() always return False.
  • Mac os (test even if not supported by plaso)/Python 3.9 -> TypeError: cannot pickle '_thread.lock' object
  • Centos/Python 3.6 -> worker.is_alive() works as expected.
from threading import Thread
from multiprocessing import Process
import time
class WorkerThread(Thread):
    def run(self):
        i = 0
        while i < 10:
            print ("worker running..")
            time.sleep(1)
            i += 1



class ProcessClass:
    def run_worker(self, worker):
        self.worker = worker
        self.worker.daemon = True
        self.worker.start()
        i = 0
        while i < 12:
            print (f"Is worker thread alive? {self.worker.is_alive()}")
            i += 1
            time.sleep(1)

            
worker = WorkerThread()
processclass = ProcessClass()
parentProcess = Process(target = processclass.run_worker, args = (worker,))
parentProcess.start()

Command line and arguments:

psort.py --analysis nsrlsvr test.plaso

Plaso version:

  • 20211229

Operating system Plaso is running on:

  • Ubuntu 20.04

Installation method:

  • installed from [GiFT PPA][https://launchpad.net/~gift] stable track
  • Manually
@joachimmetz joachimmetz added the needs closer look Issue that requires further analysis by a maintainer label Feb 17, 2022
@joachimmetz
Copy link
Member

@william-billaud thx for the report, will take a closer look when time permits.

@joachimmetz joachimmetz changed the title Undefined behavior within python can cause incomplete data processing by the hash_analysis plugins. hash_analysis: Undefined behavior within Python threading can cause incomplete processing Feb 17, 2022
@joachimmetz joachimmetz changed the title hash_analysis: Undefined behavior within Python threading can cause incomplete processing hash_tagging\: Undefined behavior within Python threading can cause incomplete processing Feb 17, 2022
@joachimmetz joachimmetz changed the title hash_tagging\: Undefined behavior within Python threading can cause incomplete processing hash_tagging: Undefined behavior within Python threading can cause incomplete processing Feb 17, 2022
@adulau
Copy link

adulau commented Jun 7, 2022

It would be nice if you can handle the issue. It would help to support hashlookup integration in plaso.

@joachimmetz
Copy link
Member

@adulau limited bandwidth atm, what is "hashlookup integration" ?

@joachimmetz
Copy link
Member

joachimmetz commented Jun 7, 2022

@adulau
Copy link

adulau commented Jun 8, 2022

Yes but there is no need to use the libraries mentioned and the server to do test and even use it.

I wrote an example in BSD 2-clauses using the Bloom filter provided to avoid remote lookup.

https://github.com/hashlookup/hashlookup-forensic-analyser

Let me know if there is any issue.

@joachimmetz
Copy link
Member

As long as the API stays the same of the server and what the end-to-end tests use
and no incompatible licenses are introduced the solution should be fine

@adulau
Copy link

adulau commented Aug 23, 2022

I suppose we can close it.

@joachimmetz
Copy link
Member

I suppose we can close it.

Why ?

@joachimmetz joachimmetz added analysis Issue related to analysis plugins and removed needs closer look Issue that requires further analysis by a maintainer labels Sep 12, 2022
@joachimmetz joachimmetz self-assigned this Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Issue related to analysis plugins
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants