tag:blogger.com,1999:blog-5236867476487043111.post2710443668646793486..comments2013-05-02T10:24:47.568-04:00Comments on Corey Goldberg: Python - Counting Items In A ListCorey Goldbergnoreply@blogger.comBlogger22125tag:blogger.com,1999:blog-5236867476487043111.post-18967670045457943992011-05-21T00:34:40.959-04:002011-05-21T00:34:40.959-04:00for item in list:
try:
item += 1
e...for item in list:<br /> try:<br /> item += 1<br /> except KeyError:<br /> item = 1Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-75714533660114583842010-07-01T11:38:05.336-04:002010-07-01T11:38:05.336-04:00Very interesting,
So is there a consensus for a ...Very interesting,<br /><br /> So is there a consensus for a faster method assuming large datasets?<br /><br /> ThanksAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-22546204815177520112008-08-06T15:26:00.000-04:002008-08-06T15:26:00.000-04:00>with the exception that set-based >methods ...>with the exception that set-based >methods were uniformly bad<BR/><BR/>Actually the set method worked relatively well when there are only a couple of possible values. The python for loops only twice and the set and count ones are at C speed<BR/><BR/>For a group of very sparse values the ternary is fastest. I guess because for most cases it does a plain d[v]=1 instead of the d[v]=0;d[v]+=1 dance.<BR/><BR/>The lesson here is that you can make any version of the algorithm look good or bad by choosing the data to feed them.nesnoreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-44893458250469459682008-07-28T17:25:00.000-04:002008-07-28T17:25:00.000-04:00The test results I got were pretty interesting, an...The test results I got were pretty interesting, and kind of all over the map, with the exception that set-based methods were uniformly bad:<BR/><BR/> test narrow wide vwide<BR/> dictget 0.231 0.278 0.283<BR/> defaultd 0.130 0.326 0.419<BR/> itertool 0.145 1.257 1.967<BR/> makeset 0.056 11.474 44.859<BR/> ternary 0.152 0.220 0.148<BR/> fp 0.073 11.524 44.994<BR/> generator 0.053 11.705 45.133<BR/><BR/>I put up the script I used to get these results at: <A HREF="http://billmill.org/static/files/uniq_count_bench.py" REL="nofollow">http://billmill.org/static/files/uniq_count_bench.py</A>.Bill Millhttp://www.blogger.com/profile/10065077215311205545noreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-16232878253460116642008-07-28T14:35:00.000-04:002008-07-28T14:35:00.000-04:00Hi Corey,Interesting exercise.Could be done this w...Hi Corey,<BR/><BR/>Interesting exercise.<BR/><BR/>Could be done this way too:<BR/><BR/>foo = [1,2,1,2,1,2,1,1]<BR/>d = {}<BR/>for v in foo: d[v] = 0<BR/>for v in foo: d[v] += 1<BR/><BR/>Can also be achived by simulating ternary operator:<BR/><BR/>foo = [1,2,1,2,1,2,1,1]<BR/>d = {}<BR/>for v in foo:<BR/> d[v] = d.has_key(v) and (d[v] + 1) or 1<BR/><BR/>Regards,<BR/><A HREF="http://www.testingperspective.com" REL="nofollow">Rahul Verma</A>Rahul Vermahttp://www.blogger.com/profile/15369178470521588425noreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-89896646799983260862008-07-28T10:05:00.000-04:002008-07-28T10:05:00.000-04:00I prefer the defaultdict(int);d[k]+=1 accumulatio...I prefer the defaultdict(int);d[k]+=1 accumulation algorithm. I can follow it the easiest because it doesn't create additional temporary data structures.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-2359724227772961602008-07-27T10:56:00.000-04:002008-07-27T10:56:00.000-04:00>>> for v in foo:... d[v] = d.get(v,0)+...>>> for v in foo:<BR/>... d[v] = d.get(v,0)+1<BR/><BR/><BR/>No conversions required, a single pass through the iterable, no special foo required. And, while I haven't benched it, I would be surprised if any of the other variants are faster.Jeff Hnoreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-78586453223262334732008-07-26T10:58:00.000-04:002008-07-26T10:58:00.000-04:00I'll benchmark some of these different solutions w...I'll benchmark some of these different solutions when I get a chance.<BR/><BR/>Interesting stuff.Corey Goldberghttp://www.blogger.com/profile/06219872951977664560noreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-33771253188002237562008-07-26T06:12:00.000-04:002008-07-26T06:12:00.000-04:00"But I do not agree with frederik, the dictionary ..."But I do not agree with frederik, the dictionary lookup is not O(1). I don't know how it's implemented in python, but I don't think they solve this in less than O(log(n))"<BR/><BR/>Python dictionaries and sets are hash tables, with open addressing and overallocation to keep the load factor below 2/3. Hash and perturbation algorithms are carefully chosen to minimize the number of collisions for a large number of real-life scenarios. In other words, O(1) lookup and amortized O(1) growth in practice. No trees anywhere in sight.Fredrikhttp://effbot.orgnoreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-49057800652348744222008-07-25T22:17:00.000-04:002008-07-25T22:17:00.000-04:00There is something that anoy me, the set(foo) is a...There is something that anoy me, the set(foo) is a really complex operation.<BR/><BR/>It need O(n) to iterate over the list and (if the set is binary tree or something like that) O(log(m)) to locate if the value is in the tree and insert it. (less when the tree is not full and more when the tree is not balanced)<BR/><BR/>So the first set(foo) is O(nlog(m)). The iteration is O(m), the count is O(n) and the insertion in the list is O(1).<BR/><BR/>Then you have something such as O(mn + nlog(m)). Like frederik said, it's ~O(mn). (but with small values of m and n, it's not really the same thing.. 2*infinty == infinity, but 2 hours is far more than one ;)<BR/><BR/>But I do not agree with frederik, the dictionary lookup is not O(1). I don't know how it's implemented in python, but I don't think they solve this in less than O(log(n))<BR/><BR/>For me the frederik solution is the better, and it's O(nlog(m)) in the worst case (O(n) for iteration and O(log(m)) for the lookup and incrementation)<BR/><BR/>The problem with all that mathematical stuff it's that it become true for large value of n/m and does not take into account the difference between C builtins and python code, which can be enormous for little values of n/m)<BR/><BR/>;)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-84516730583460571742008-07-25T19:50:00.000-04:002008-07-25T19:50:00.000-04:00I'll go with your original or the defaultdict solu...I'll go with your original or the defaultdict solution. Why? Because they run fast (see frederic) and they are the most readable solutions.<BR/><BR/>Using map, zip and set in combination is nice for academic questions, but in cases where you have to share the code with others, it most often leads to misunderstandings and bugs in the codePeternoreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-5518785756770401262008-07-25T18:39:00.000-04:002008-07-25T18:39:00.000-04:00on what machines can the original possibly run fas...on what machines can the original possibly run faster?kelsonnoreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-19485469428870106212008-07-25T17:55:00.000-04:002008-07-25T17:55:00.000-04:00Dude, your original snippet is by far the best sol...Dude, your original snippet is by far the best solution here. Easy to read, works. I'll take that over hard to read, runs faster.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-69519827309686023652008-07-25T16:56:00.001-04:002008-07-25T16:56:00.001-04:00(where "additions" in my comment meant "additions ...(where "additions" in my comment meant "additions to the dictionary", not "additions to the counter")<BR/><BR/>As for Walter's "groupby" solution, it combines an O(n log n) sort with at three O(n) passes over the data (copy list, groupby, getting the length of each group operator), but these operations are all written in C, so I wouldn't be surprised if the approach could beat a single-pass O(n) python solution for also for relatively large n:s. Worth benchmarking, at least.Fredriknoreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-28210691204063472032008-07-25T16:56:00.000-04:002008-07-25T16:56:00.000-04:00Thanks for the great tips. I will update the post...Thanks for the great tips. <BR/><BR/>I will update the post with some of these better approaches.Corey Goldberghttp://www.blogger.com/profile/06219872951977664560noreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-78658171899265061992008-07-25T16:36:00.000-04:002008-07-25T16:36:00.000-04:00you wanted O(n)...foo = [1, 2, 1, 2, 1, 2, 1, 1]d ...you wanted O(n)...<BR/><BR/>foo = [1, 2, 1, 2, 1, 2, 1, 1]<BR/>d = {}<BR/>[d.__setitem__(f,1+d.get(f,0)) for f in foo]kelsonnoreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-57566843002283112522008-07-25T15:39:00.000-04:002008-07-25T15:39:00.000-04:00What about itertools?from itertools import groupby...What about itertools?<BR/><BR/>from itertools import groupby<BR/>from operator import itemgetter<BR/>foo = [1, 2, 1, 2, 1, 2, 1, 1]<BR/>x = dict([(a,len(list(b))) for a,b in groupby(sorted(foo))])<BR/>print(x)Walter Cruzhttp://www.blogger.com/profile/10836863572075307104noreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-15111100717725287342008-07-25T15:08:00.000-04:002008-07-25T15:08:00.000-04:00d = defaultdict(int) is loads better than d = defa...d = defaultdict(int) is loads better than d = defaultdict(lambda: 0). I knew there must have been a shortcut for that!Bill Millhttp://www.blogger.com/profile/10065077215311205545noreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-55534246663920518392008-07-25T15:05:00.000-04:002008-07-25T15:05:00.000-04:00To demonstrate Fredrik's proposed solution:In ...To demonstrate Fredrik's proposed solution:<BR/><BR/><BR/>In [6]: x = [1,1,2,2,2,2,2,1,1,2,2,1,1,1,2]<BR/><BR/>In [7]: d = defaultdict(lambda: 0)<BR/><BR/>In [8]: for y in x: d[y] += 1<BR/><BR/>In [9]: d<BR/><BR/>Out[9]: defaultdict(<function <lambda> at 0x1123e30>, {1: 7, 2: 8})Bill Millhttp://www.blogger.com/profile/10065077215311205545noreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-79603889050514881242008-07-25T14:30:00.000-04:002008-07-25T14:30:00.000-04:00The count method is O(n), so you end up with O(n*m...The count method is O(n), so you end up with O(n*m) performance, where m is the size of the alphabet. Or in other words, your solution can get really slow really quick.<BR/><BR/>A more robust solution is to use a dictionary containing counters (e.g. "defaultdict(int)") and then do a single pass over the data. Since dictionary lookups are O(1) and additions are amortized O(1), the resulting algorithn is O(n).<BR/><BR/>(If you *know* that you have a list of small integers in a limited range, like in your example, you can of course use a plain list instead of a dictionary).Fredriknoreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-47180670035816447382008-07-25T14:29:00.000-04:002008-07-25T14:29:00.000-04:00This comment has been removed by the author.Roger Demetrescuhttp://www.blogger.com/profile/08741324019734334286noreply@blogger.comtag:blogger.com,1999:blog-5236867476487043111.post-81996997626726032852008-07-25T14:19:00.000-04:002008-07-25T14:19:00.000-04:00maybe not super pythonic, but FP-ythonic:bar = dic...maybe not super pythonic, but FP-ythonic:<BR/><BR/>bar = dict ( <BR/> zip (<BR/> set(foo), <BR/> map (foo.count, set(foo)))<BR/> )<BR/><BR/>ah... this one is more pythonic:<BR/><BR/>bar = dict ((i,foo.count(i)) <BR/> for i in set(foo))<BR/><BR/>cheers,<BR/>sebastienAnonymousnoreply@blogger.com