Find and count strings of text from multiple files


Joined
Dec 14, 2021
Messages
14
Reaction score
0
Hello all,
I would like to see if someone has a solution for a problem that's been giving me a headache for some time now:
I have a text file with "patterns" i'd like to find and count on a target file:

FileA.txt (patterns)
apple
orange
cherry
melon
watermelon
tangerine
kiwi

FileB.txt (target)
red apples
green apples
black cherry
red cherry
green melon
watermelon
tangerine

I want to generate an output file that would read:
FileB_counts.txt
apple 2
orange 0
cherry 2
melon 2
watermelon 1
tangerine 1
kiwi 0

This is easy on excel using countif however my files are hundreds of thousands of "patterns" and millions of lines on the target file. I have to leave the computer running all night and it does it but it just takes way too much time. I'm sure there has to be a better way using either unix/bash or with a python program.

I tried using bash commands in ubuntu:

grep -cf FileA.txt FileB.txt >> output.txt

but the output file is just "8" meaning it found 8 patterns, but I really need to see which ones, how many times.
I did the "melon" and "watermelon" on purpose to show that it does not need to be exact word, it can be a string within a longer word.

I hope someone has experience and can give me some guidance on how to solve this.

Thanks!
 
Ad

Advertisements

Joined
Mar 3, 2021
Messages
243
Reaction score
29
Sure, give this a shot. It takes two arguments: the file with the patterns , then the target file. You can hardcode those if it's easier, just replace the sys.argv[#] at the bottom. If you still need more skookum, this can easily be done in C/C++.

Python:
#!/usr/bin/env python3

import sys

def main(pattern_file, target_file):
        patterns = None
        with open(pattern_file) as f:
                patterns = dict((l.strip(), 0) for l in f)

        with open(target_file) as f:
                for l in f:
                        for k in patterns.keys():
                                if k in l:
                                        patterns[k] += 1
        for k, v in patterns.items():
                print(f"{k} {v}")

if __name__ == '__main__':
        sys.exit(main(sys.argv[1], sys.argv[2]))
 
Joined
Dec 14, 2021
Messages
14
Reaction score
0
Thanks! I have a long way to go to be able to figure out this things by myself.

I named my files pattern_file and target_file so not to change the code (I left it like that because if I add the extension it gives me an error) is that ok? no quotation marks either?

When I run it, it gives me this error message on line 16:

print(f"{k} {v}")
^
SyntaxError: invalid syntax


Signaling the second quote mark. What am I doing wrong?
 

Attachments

  • syntaxerror.jpg
    syntaxerror.jpg
    30.2 KB · Views: 11
Joined
Mar 3, 2021
Messages
243
Reaction score
29
Weird, what version of Python do you have? That style, using an f-string, was added in in Python 3.6. Try replacing it with an older style string interpolation.

Python:
print("%s %s" % (k, v))
 
Joined
Mar 3, 2021
Messages
243
Reaction score
29
If it still explodes, post the code you're running. Something might've gotten lost on the wire.
 
Ad

Advertisements

Joined
Dec 14, 2021
Messages
14
Reaction score
0
OK so I actually have the 2.7

I amended the original code and now this is what I get:

line 19, in <module>
sys.exit(main(sys.argv[1], sys.argv[2]))
IndexError: list index out of range

This is what I'm doing and I apologize in advancer for any obvious mistakes I'm making:

I save the code in a .py file (named code.py), and saved it in my c/destination folder

I also saved in the destination folder

pattern_file.txt (text file with patterns)

target_file.txt (text file with target lines)

then in Ubuntu I run:

python c/destination folder/code.py >> output.txt

this is the python script in the py folder. I didn't change anything except the end

#!/usr/bin/env python3

import sys

def main(pattern_file, target_file):
patterns = None
with open(pattern_file) as f:
patterns = dict((l.strip(), 0) for l in f)

with open(target_file) as f:
for l in f:
for k in patterns.keys():
if k in l:
patterns[k] += 1
for k, v in patterns.items():
print("%s %s" % (k, v))

if __name__ == '__main__':
sys.exit(main(sys.argv[1], sys.argv[2]))

When I tried to change the names of the files in the code it would give me error codes again, I'm guessing I should been icluding the names somewhere, I just don't know where.
 
Joined
Mar 3, 2021
Messages
243
Reaction score
29
I'm not 100% it'll run on Python 2, but we can give it a shot. I'll update it tomorrow if it can't. When you run it, give it the paths of the files.

Code:
python c/destination folder/code.py path_to_patterns_file path_to_target_file >> output.txt
 
Joined
Dec 14, 2021
Messages
14
Reaction score
0
That did the trick!

python c/destination folder/code.py path_to_patterns_file path_to_target_file >> output.txt

I was not summoning the files (duh!)

Thanks a lot. I'll let you know if updating the python and using the other code still works.
 
Joined
Dec 14, 2021
Messages
14
Reaction score
0
My pattern file has wildcards, in excel I use (?) and works, but the python code is returning 0 matches for the patterns with wildcards, I tried (?), (.) and (*), all return 0 on those patterns, how can I include wildcards in the pattern dictionary?
 
Last edited:
Joined
Mar 3, 2021
Messages
243
Reaction score
29
For that, we'll have to turn to regular expressions. It'll be a little slower, but what can ya do. You can use all the power of regular expressions, which is considerable. But, now, a period will match any character and you can extend it to do a whole lot more. Tested with Python 2.7 this time. One difference between Python 2 and 3 on this one is that, in 2, the patterns printed out will not be in the same order as they were in the patterns file. That can easily be achieved with an OrderedDict, though, if desired.

Python:
#!/usr/bin/env python2

import re
import sys

def main(pattern_file, target_file):
        patterns = None
        with open(pattern_file) as f:
                patterns = dict((re.compile(l.strip()), 0) for l in f)

        with open(target_file) as f:
                for l in f:
                        for k in patterns.keys():
                                if k.search(l):
                                        patterns[k] += 1
        for k, v in patterns.items():
                print "%s %s" % (k.pattern, v)

if __name__ == '__main__':
        sys.exit(main(sys.argv[1], sys.argv[2]))
 
Ad

Advertisements

Joined
Dec 14, 2021
Messages
14
Reaction score
0
yes! I noticed the patterns came in different order in the output, but is not an issue for me, Thanks again for your amazing help! I'll give this one a try and let you know how it goes.
 
Joined
Mar 3, 2021
Messages
243
Reaction score
29
There are a lot of factors in that. It's likely that Excel has both files entirely cached, in memory, and can multithread it. And MATCH (I assume that's what you're using) might be faster than regular expressions, as it has less functionality. If you're willing and able to compile a C++ program, I can try it in that, with threads and all. If you have larger test files I can use, I can benchmark them a little better.
 
Joined
Dec 14, 2021
Messages
14
Reaction score
0
Hi!
Sorry I went on vacation and didn't check my email. I really have no experience with C++. I can share some files if you want to play with them a little.
 
Ad

Advertisements

Joined
Dec 14, 2021
Messages
14
Reaction score
0
Here are 3 files:
1 with the patterns (with wild cards), and 2 target files. Ideally I would like to run several files at once and generate a file with the list of patterns on the target files.
 

Attachments

  • Target_1.zip
    408.6 KB · Views: 10
  • Target_2.zip
    703.5 KB · Views: 10
  • Patterns.zip
    369.5 KB · Views: 8
Ad

Advertisements

Joined
Dec 14, 2021
Messages
14
Reaction score
0
Here are 3 files:
1 with the patterns (with wild cards), and 2 target files. Ideally I would like to run several files at once and generate a file with the list of patterns on the target files.
More like "generate a file with the number of patterns found on each file"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top